This is an exploration of the diamonds dataset, which is embedded in the ggplot library.
library(tidyverse)
## ── Attaching packages ───────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
length(diamonds$carat)
## [1] 53940
length(diamonds)
## [1] 10
there are 53940 observations of 10 variables. There are three ordered factors (cut, color, clarity) with the best color being “D”
a price histogram
ggplot(data=diamonds,aes(x=price))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mean(diamonds$price)
## [1] 3932.8
median(diamonds$price)
## [1] 2401
The distribution has a long right tail with a mean of $3932.80 and a median of $2401.
How many diamonds cost less than $500? How many diamonds cost less than $250? How many diamnods cost $15,000 or more?
sum(diamonds$price<500)
## [1] 1729
sum(diamonds$price<250)
## [1] 0
sum(diamonds$price>=15000)
## [1] 1656
Try limiting the x-axis, altering the bin width, and setting different breaks on the x-axis.
There won’t be a solution video for this question so go to the discussions to share your thoughts and discover what other people find.
You can save images by using the ggsave() command. ggsave() will save the last plot created. For example… qplot(x = price, data = diamonds) ggsave(‘priceHistogram.png’)
ggplot(data=diamonds,aes(x=price))+
geom_histogram(binwidth=50)+
coord_cartesian(xlim=c(250,2500))
## Break out histograms by cut
ggplot(data=diamonds,aes(x=price))+
geom_histogram(binwidth=50)+
facet_wrap(~cut)
Which has the highest priced diamond? Which has the lowest priced diamond? Which has the lowest median price?
by(diamonds$price,diamonds$cut,max)
## diamonds$cut: Fair
## [1] 18574
## --------------------------------------------------------
## diamonds$cut: Good
## [1] 18788
## --------------------------------------------------------
## diamonds$cut: Very Good
## [1] 18818
## --------------------------------------------------------
## diamonds$cut: Premium
## [1] 18823
## --------------------------------------------------------
## diamonds$cut: Ideal
## [1] 18806
by(diamonds$price,diamonds$cut,min)
## diamonds$cut: Fair
## [1] 337
## --------------------------------------------------------
## diamonds$cut: Good
## [1] 327
## --------------------------------------------------------
## diamonds$cut: Very Good
## [1] 336
## --------------------------------------------------------
## diamonds$cut: Premium
## [1] 326
## --------------------------------------------------------
## diamonds$cut: Ideal
## [1] 326
by(diamonds$price,diamonds$cut,median)
## diamonds$cut: Fair
## [1] 3282
## --------------------------------------------------------
## diamonds$cut: Good
## [1] 3050.5
## --------------------------------------------------------
## diamonds$cut: Very Good
## [1] 2648
## --------------------------------------------------------
## diamonds$cut: Premium
## [1] 3185
## --------------------------------------------------------
## diamonds$cut: Ideal
## [1] 1810
Which has the highest priced diamond? premium Which has the lowest priced diamond? Premium and Ideal Which has the lowest median price? Ideal
In the last exercise, we looked at the summary statistics for diamond price by cut. If we look at the output table, the the median and quartiles are reasonably close to each other.
This means the distributions should be somewhat similar, but the histograms we created don’t show that.
The ‘Fair’ and ‘Good’ diamonds appear to have different distributions compared to the better cut diamonds. They seem somewhat uniform on the left with long tails on the right.
Let’s look in to this more.
Look up the documentation for facet_wrap in R Studio. Then, scroll back up and add a parameter to facet_wrap so that the y-axis in the histograms is not fixed. You want the y-axis to be different for each histogram.
ggplot(data=diamonds,aes(x=price))+
geom_histogram(binwidth=50)+
facet_wrap(~cut,scales='free')
ggplot(data=diamonds,aes(x=price/carat))+
geom_histogram()+
scale_x_log10()+
facet_wrap(~cut,scales='free')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=diamonds,aes(x=color,y=price/carat))+
geom_boxplot()
by(diamonds$price,diamonds$color,summary)
## diamonds$color: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 357 911 1838 3170 4214 18693
## --------------------------------------------------------
## diamonds$color: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326 882 1739 3077 4003 18731
## --------------------------------------------------------
## diamonds$color: F
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342 982 2344 3725 4868 18791
## --------------------------------------------------------
## diamonds$color: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 354 931 2242 3999 6048 18818
## --------------------------------------------------------
## diamonds$color: H
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 337 984 3460 4487 5980 18803
## --------------------------------------------------------
## diamonds$color: I
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1120 3730 5092 7202 18823
## --------------------------------------------------------
## diamonds$color: J
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 1860 4234 5324 7695 18710
summary(subset(diamonds,color=='D')$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 357 911 1838 3170 4214 18693
summary(subset(diamonds,color=='J')$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335 1860 4234 5324 7695 18710
IQR(subset(diamonds,color=='D')$price)
## [1] 3302.5
IQR(subset(diamonds,color=='J')$price)
## [1] 5834.5
ggplot(data=diamonds,aes(x=price/carat))+
geom_histogram()+
scale_x_log10()+
facet_wrap(~color,scales='free',nrow=3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=diamonds,aes(x=carat))+
geom_freqpoly(binwidth=.01)
ggplot(data=diamonds,aes(x=x,y=price))+
geom_point()
What is going on? there is some exponential trend
with(diamonds,cor.test(x,price))
##
## Pearson's product-moment correlation
##
## data: x and price
## t = 440.16, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8825835 0.8862594
## sample estimates:
## cor
## 0.8844352
with(diamonds,cor.test(y,price))
##
## Pearson's product-moment correlation
##
## data: y and price
## t = 401.14, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8632867 0.8675241
## sample estimates:
## cor
## 0.8654209
with(diamonds,cor.test(z,price))
##
## Pearson's product-moment correlation
##
## data: z and price
## t = 393.6, df = 53938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8590541 0.8634131
## sample estimates:
## cor
## 0.8612494
ggplot(data=diamonds,aes(x=depth,y=price))+
geom_point(alpha=1/100)+
scale_x_continuous(breaks=seq(2,80,2))
typical depth range is between 58 and 64.
with(diamonds,cor.test(depth,price))
##
## Pearson's product-moment correlation
##
## data: depth and price
## t = -2.473, df = 53938, p-value = 0.0134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.019084756 -0.002208537
## sample estimates:
## cor
## -0.0106474
ggplot(data=subset(diamonds,price<quantile(price,.99) & carat<quantile(carat,.99)),
aes(x=carat,y=price))+
geom_point()
diamonds%>%
mutate(volume=x*y*z)->diamonds
ggplot(data=diamonds,aes(x=volume,y=price))+
geom_point()
How many diamonds have 0 volume??
sum(diamonds$volume==0)
## [1] 20
with(subset(diamonds,volume>0 & volume<=800),cor.test(price,volume))
##
## Pearson's product-moment correlation
##
## data: price and volume
## t = 559.19, df = 53915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9222944 0.9247772
## sample estimates:
## cor
## 0.9235455
We encourage you to think about this next question and to post your thoughts in the discussion section.
Do you think this would be a useful model to estimate the price of diamonds? Why or why not?
ggplot(data=subset(diamonds,volume>0 & volume<=800),
aes(x=volume,y=price))+
geom_point(alpha=1/50)+
geom_smooth()
## `geom_smooth()` using method = 'gam'
This does not seem like a great fit because it shows that at some point, diamonds get cheaper per volume, which is very false.
Use the function dplyr package to create a new data frame containing info on diamonds by clarity.
Name the data frame diamondsByClarity
diamondsByClarity<-diamonds%>%
group_by(clarity)%>%
summarise(mean_price=mean(price),
median_price=median(price),
min_price=min(price),
max_price=max(price),
n=n())
head(diamondsByClarity)
## # A tibble: 6 x 6
## clarity mean_price median_price min_price max_price n
## <ord> <dbl> <dbl> <dbl> <dbl> <int>
## 1 I1 3924 3344 345 18531 741
## 2 SI2 5063 4072 326 18804 9194
## 3 SI1 3996 2822 326 18818 13065
## 4 VS2 3925 2054 334 18823 12258
## 5 VS1 3839 2005 327 18795 8171
## 6 VVS2 3284 1311 336 18768 5066
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))
diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
diamonds_mp_by_clarity
## # A tibble: 8 x 2
## clarity mean_price
## <ord> <dbl>
## 1 I1 3924
## 2 SI2 5063
## 3 SI1 3996
## 4 VS2 3925
## 5 VS1 3839
## 6 VVS2 3284
## 7 VVS1 2523
## 8 IF 2865
dclar<-ggplot(data=diamonds_mp_by_clarity,aes(x=clarity,y=mean_price))+
geom_col()
dcol<-ggplot(data=diamonds_mp_by_color,aes(x=color,y=mean_price))+
geom_col()
grid.arrange(dclar,dcol,ncol=1)
Facet the histogram by diamond color and use cut to color the histogram bars.
ggplot(data=diamonds,aes(x=price,fill=cut))+
geom_histogram(bins=50)+
scale_x_log10()+
facet_wrap(~color)+
scale_fill_brewer(type='qual')
ggplot(data=diamonds,aes(y=price,x=table,color=cut))+
geom_point()+
scale_color_brewer(type='qual')
What is the typical table range for diamonds of ideal cut?
53-57
What is the typical table range for diamonds of premium cut?
58-62
Use scale on the y-axis to take the log10 of price. You should also omit the top 1% of diamond volumes from the plot.
ggplot(data=subset(diamonds,volume<quantile(volume,.99)&
volume>0),
aes(x=volume,y=price,color=clarity))+
geom_point()+
scale_y_log10()+
scale_color_brewer(type='div')
ggplot(data=diamonds,aes(x=cut,y=price/carat,color=color))+
geom_jitter(size=1)+
facet_wrap(~clarity)+
scale_color_brewer(type='div')